Assessment of small strain modulus in soil using advanced computational models

Small-strain shear modulus (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_0$$\end{document}G0) of soils is a crucial dynamic parameter that significantly impacts seismic site response analysis and foundation design. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_0$$\end{document}G0 is susceptible to multiple factors, including soil uniformity coefficient (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C_u$$\end{document}Cu), void ratio (e), mean particle size (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_{50}$$\end{document}d50), and confining stress (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma '$$\end{document}σ′). This study aims to establish a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_0$$\end{document}G0 database and suggests three advanced computational models for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_0$$\end{document}G0 prediction. Nine performance indicators, including four new indices, are employed to calculate and analyze the model’s performance. The XGBoost model outperforms the other two models, with all three models achieving \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document}R2 values exceeding 0.9, RMSE values below 30, MAE values below 25, VAF values surpassing 80%, and ARE values below 50%. Compared to the empirical formula-based traditional prediction models, the model proposed in this study exhibits better performance in IOS, IOA, a20-index, and PI metrics values. The model has higher prediction accuracy and better generalization ability.

The dynamic shear modulus (G) of the soil is a critical parameter in seismic site response analysis and dynamic foundation design.Its value decreases as the shear strain amplitude ( γ a ) increases.When the γ a is less than 10 −6 , G is referred to as the small strain shear modulus ( G 0 ).Many researchers have conducted studies on G 0 for sandy soils using cyclic triaxial tests, bending element tests, and resonant column tests.They have analyzed the influencing factors and proposed predictive models.
Resonant column experiments were employed to investigate the influencing factors on the G 0 of pure silica sand.The results indicate that G 0 exhibits a power-law growth relationship with increasing the confining stress ( σ ′ ) values.At constant σ ′ values, G 0 decreases with the increase of the void ratio (e) 1-3 .Similar findings have been observed in the studies conducted by multiple scholars [4][5][6][7][8][9] .The Hardin model is the most commonly used G 0 prediction model 1,3 : where A and n are fitting parameters, f(e) is the function of e, and P a is reference stress, usually atmosphere pressure (i.e., 100 kPa).
Particle size distribution is one of the most important factors influencing the G 0 of sandy soils.Existing research studies often consider the uniformity coefficient ( C u ) and the median particle size ( d 50 ) as characteristic parameters of the particle size distribution when conducting a comprehensive analysis of the effect of particle size distribution on G 0 .
The impact of particle gradation on the G 0 of sandy soils was investigated through resonant column experiments 10 .The findings reveal that G 0 increases with the rise of C u and d 50 when relative density ( D r ) appear similar, with d 50 exerting more prominent influence compared to C u .Furthermore, in the Hardin model, the parameter n escalates with the increase of C u .In resonant column frequency sweep experiments conducted on uniformly graded natural sand, when d 50 is less than 1.8 mm, G 0 gradually increases with the enlargement of d 50 11 .Experimental investigations were conducted using the same methodology on silica sands with vary- ing gradations 12 .The results reveal that when d 50 remains constant for both e and σ ′ , its variation does not (1) Transportation Institute, Inner Mongolia University, Hohhot 010021, China.significantly affect G 0 .However, an increase in C u results in a decrease of G 0 .Similar results have been observed in resonant column and bending element experiments with uniformly sized glass bead powders of different particle sizes 7 : when e remains constant, although G 0 decreases with increasing d 50 , this decrease can be con- sidered negligible.The diversity of soil types and experimental methods has led to variations in predictive models for G 0 .As a result, predicted results for different types of sandy soils often show significant differences when compared to actual measured data.Machine learning algorithms have the ability to model data based on a large amount of experimental data and adaptively adjust model parameters according to different soil characteristics.This approach allows for a more comprehensive consideration of various influencing factors, ultimately improving the comprehensiveness and accuracy of predictive models.Several researchers have already used machine learning algorithms to build predictive models in this regard.
Probabilistic machine learning has been integrated into predictions of desiccation cracks with uncertain input 13 .Six different machine learning algorithms were analyzed, and it was found that the enhanced XGBT model had greater predictive capacity.Additionally, multivariate robust regression analysis and exponential smoothing time-series analysis were used to address zeolite data, resulting in the development of a novel prediction process that correlates influential variables of zeolite-alkali activated sand 14 .The ANFSI, MLP, and MRM machine learning algorithms were used to develop predictions for G 0 and minimum damping ratio ( D min ) of sand containing rubber particles 15 .The input parameters for these models are σ ′ , fiber content, and rubber con- tent.Neural networks with 3-3-1 and 3-2-1 architectures showed good performance in predicting the damping ratio(D) and G of mica-sand mixtures 16 .When the Backpropagation Neural Net-work(BPNN) model was used to predict the G and D of marine clay 17 , the established model showed good predictive performance for marine clay at different strains and depths.In addition, a novel genetic expression programming model was used to predict the normalized shear modulus and damping ratio of sandy soils 18 .
Currently, traditional empirical formulas for predicting G 0 are constrained by the limitations imposed by the types of soils in the experimental data and the variety of experimental methods, resulting in limited generalization performance of prediction models.In contrast, machine learning algorithms have demonstrated excellent performance in handling large data sets and predicting soil performance indicators.
In this study, G 0 data from several published papers are used to build a G0 database for sandy soils.Three machine learning algorithms-BPNN, Genetic Algorithm-enhanced BP Neural Network (GA-BP), and Extreme Gradient Boosting algorithm (XGBoost) for their strong performance in soil prediction are selected to build prediction models.The input features for these models include soil grading characteristics, such as C u and d 50 , and state parameters, such as σ ′ and e.The goal is to obtain accurate predictions of the G 0 for different types of sandy soils.A comparative analysis is performed to evaluate the predictive performance of the three machine learning models for G 0 in sandy soils, and a comparison with traditional empirical relationship models is made to validate the effectiveness and accuracy of the machine learning models.

Data collection and data analysis
In this study, a dataset of 1966 sets of G 0 values was obtained from 13 different literatures using various experi- mental methods such as resonant column tests, bending element tests, and torsional shear tests 6,12,[19][20][21][22][23][24][25][26][27][28][29] .The dataset includes a wide range of soil types and particle size distributions, including 481 sets of coral sand, 675 sets of silica sand, and 96 sets of sandy gravel obtained using three different sampling techniques.
According to the above factors affecting G 0 of the soil, C u and d 50 , which can reflect the grading character- istics, e, which reflects the density state, and σ ′ , which reflects the stress state, are selected as input parameters in this paper.Supplement Table S1 provides essential information about the experiments conducted in each referenced source.Soil classification was performed according to ASTM 2487 standards to achieve uniformity.Specifically, soils with poor particle size distribution and those containing poorly graded gravels were designated SP.Sands with poor grain size distribution and the presence of silt were designated SP-SM.Sands with a well-graded particle size distribution and silt content were labeled SW-SM.Pure silt was classified as ML.Well graded gravels were identified as GW, while poorly graded gravels were identified as GP based on their particle size distribution and characteristics.
Each database has numerous data points spread across numerous rows and columns, which makes it challenging to comprehend.Descriptive statistics are thus generated for the database.In the present research, descriptive statistical parameters, mean, SE mean, StDev, variance, coefficient of variance, minimum, Q1, median, Q3, maximum, IQR, skewness and kurtosis have been calculated for overall, training, testing databases, as mentioned in Table 1 [30][31][32] .Table 1 demonstrates that the overall database contains a number of C u , d 50 , e, σ ′ and G 0 in the range of 1.2-65.49,0.02-10, 0.25-5.19,20-700, 12.36-460.79.As a result, Fig. 1 depicts the frequency distribution of the database's C u , d 50 , e, σ ′ and G 0 variables.Before utilizing the database for training and testing purposes, the database has been preprocessed, and missing data and outliers have been removed and normalized by the min-max normalization function = (x-min)/(max-min), where x is the actual value.

Backpropagation neural network (BPNN) model
Several writers have employed the BPNN modelling approach 33,34 , BPNN is a highly efficient and widely used artificial neural network model that consists of three main layers: the input layer, the hidden layer, and the output layer.Neurons in the input layer can directly interact to receive and process data, while the output layer is responsible for visualizing the results.The neurons in the hidden layer, while not directly visible, play a critical role in processing and transforming the information.Figure 2 shows a schematic of a three-layers BPNN.The neural network consists of three layers: the input layer represented by a m , the hidden layer represented by b u , and the output layer represented by c n .Let w mn denote the connection weight between the m th neuron in the input layer and the uth neuron in the hidden layer, and let v mn denote the connection weight between the u th neuron in the hidden layer and the nth neuron in the output layer.Thus, we can obtain the expressions for the neurons in the hidden layer and the output layer as given in Eqs. ( 2) and (3): the excitation function is the Sigmoid function, k u is the threshold of the hidden layer neuron, p n is the threshold of the output layer neuron.Each time the output value is compared with the desired output, if the mean square error does not meet the predetermined requirements, the back-propagation process is carried out and the mean square error is returned in the form of a gradient and assigned to each layer neuron.The process is repeated until the mean square error converges.The "Genetic monitoring" mechanism involves determining the occurrence of genetic degradation by comparing the maximum individual fitness values between the (n + 1) th and nth generations of a population.Let the maximum individual fitness value of the nth generation be denoted as Infitmax (n) .If the population fitness satisfies Eq. ( 4), it is identified as genetic degradation.In such cases, the "Life-death individual alternation" mechanism is activated to increase the diversity of individuals within the population, thereby reducing the probability of genetic degradation.If genetic degradation is not effectively controlled throughout the process, the individual with the highest fitness in the current genetic process is selected as the final outcome.
The formula for calculating individual fitness Infit (i) is given in Eq. ( 5).
where p is the number of output neurons; y i is the desired output of the ith output neuron; o i is the network output of the ith output neuron.
The "Life-death individual alternation" mechanism involves identifying individuals in the offspring population whose fitness is lower than the average fitness of that population as "deceased individuals".These "deceased individuals" are then replaced by an equal number of randomly generated "newborn individuals".The purpose of this mechanism is to preserve the best performing individuals in the current population, while introducing new individuals to prevent the population from becoming less diverse.It helps the algorithm to break out of the current genetic degradation deadlock and prevents it from getting stuck in a "local optimum".This mechanism is designed to maintain a balance between preserving promising individuals and introducing new genetic material to increase the genetic diversity of the population.
The GA-BP model uses a genetic algorithm during the training process, which uses selection, crossover and mutation operations to perpetuate genetic information.It continuously activates the BP neural network to calculate fitness values for each generation of the population.Throughout this process, the 'Genetic monitoring mechanism' and the 'Life-death individual alternation mechanism' come into play until the genetic constraints are met.The individual obtained through genetic evolution is then compared with the individual with the highest fitness value throughout the genetic process.The one with the higher fitness is decoded and assigned to the (4) www.nature.com/scientificreports/BP neural network for local optimisation.This process continues until the network's output meets the required error limits and other criteria, at which point the algorithm is completed.

Extreme gradient boosting(XGBoost) model
The Extreme Gradient Boosting algorithm(XGBoost), is an ensemble machine learning algorithm based on decision trees that works within the gradient boosting framework.This algorithm combines several basic learners to create a powerful model.XGBoost is an efficient implementation of the Gradient Boosting Decision Trees (GBDT) algorithm and optimises various aspects of GBDT, including the objective function, the optimisation method, the handling of missing values and the prevention of overfitting 35,36 .When XGBoost is running, it first trains a basic learner on the training data set.The results produced by this learner are then adjusted based on the training samples, and the next learner is trained using these adjusted samples.This iterative process continues for several rounds until the number of base learners reaches a predefined value.Finally, all the base learners are combined and the computation proceeds as follows: (1) Constructing Boosting Models: where F is the set of all regression trees.
(2) Training objective function: where " n i=1 l(y i , y (t+1) + f t (X i )) " is the loss function and " �(f t ) " is the sum of all regularisation terms.(3) The loss function is subjected to a second order Taylor expansion and the final objective function is obtained:

Performance evaluation
The performance of the machine learning models was assessed using several metrics.The mathematical formulation of the performance metrics is as follows [37][38][39] : Coefficient of determination ( R 2 )

Absolute relative error(ARE)
Root Mean Square Error (RMSE)

Mean Absolute Error (MAE)
Variance Accounted For (VAF) where y i is the measured value, ȳi is the mean value of y i , ŷi is the predicted value of y i , m20 is the ratio of experi- mental to the predicted value (0.8-1.2), H is the total number of data samples, and n is the number of predicted samples.perfectpredictive model always has a performance equal to the ideal value mentioned in Table 2.

Sensitivity analysis
Sensitivity analysis identifies variables that will most significantly impact predictions.There are global and local forms of sensitivity analysis.Various techniques are used to conduct sensitivity analysis, including the cosine amplitude technique applied in this study.The mathematical expression for the cosine amplitude method is 40,41 : where X ic is input parameters C u , d 50 , e, and σ ′ , and X jk is output parameter G 0 .A strong influencing input variable always has a SS value near one.In this study, 666 data points have been collected from the field.The sensitivity analysis result has been drawn, as depicted in Fig. 3. Figure 3 illustrates that the input parameters, such as C u , d 50 , e, and σ ′ , highly influence G 0 prediction.Com- paring all input variables, the C u variable (= 0.59) influences the G 0 prediction less than other input variables.

Modeling
Three methods BPNN,GA-BP and XGB were used to model the prediction of G 0 .The input parameters for these models included four soil-related factors: C u , d 50 , e and σ ′ .The output parameter selected for prediction was G 0 .Around a third of the data, 666 data points in total, were chosen from the database to be the training set, while the remaining 1300 data points were designated as the test set.The minimum and maximum values and other descriptive statistics of the data in the training set align well with the database and accurately represent the data contained within.
Both the BPNN and GA-BP models used 12 hidden layer neurons, the learning functions are all chosen as trainlm functions, and the transfer functions between the input layer and the implicit layer, and between the implicit layer and the output layer are all chosen as Sigmoid functions.In the case of the XGBoost model, a regression model using XGBRegressor was used.This model was configured with a maximum tree depth of 6, 100 base learners and a learning rate set to 0.1, the rest of the parameters are default values.
Figure 4 illustrates the comparison between the predicted values of G 0 by different prediction models and the actual measured values on the training set.It can be seen that each model can accurately predict the dynamic shear modulus of the soil during training.
Furthermore, the data points for all three models are evenly distributed on both sides of the 45 degree line, indicating reasonably good predictions.Of the models, the XGBoost model shows the best training performance with R 2 of 0.99.The ARE is consistently below 10%, with 98% of data points having ARE values below 5%.The GA-BP model comes next with R 2 of 0.97 and approximately 67% of the data points have ARE values below 25%.However, the BPNN model has the worst training performance with an R 2 of 0.94.

Model performance analysis
Figure 5 shows a comparison of the predictions made by different prediction models for the G 0 on the test data set with the actual measured values.Table 3 shows the performance metrics for the different models.In the figure, the scatter points represent the model predictions for the samples in the test data set.The figure shows that the scatter points for all three models are evenly distributed around the diagonal line (Y = X), indicating that the models have achieved good predictive performance for G 0 .
The model with the highest prediction accuracy is the XGBoost model, closely fol-lowed by the GA-BP model, which has slightly lower prediction performance compared to XGBoost.On the other hand, the BPNN model has the worst prediction performance.As can be seen in Fig. 5a, the scatter points of the predictions of the BPNN model are more spread out, indicating less accuracy in its predictions.
Figure 6 shows the distribution of the ARE for the predictions made by the different models for each dataset in the database compared to the actual measured values.It can be observed that the ARE for all three models is mainly concentrated below 40%.
The XGBoost model has relatively low ARE values, indicating excellent predictive performance.The GA-BP model slightly outperforms the BPNN model in terms of prediction.As shown in Table 3, the evaluation metrics for the XGBoost model are consistently superior to those of the BPNN and GA-BP models.In summary, these results suggest that the XGBoost model performs best in predicting the G 0 , while the BPNN model has inferior prediction performance and the GA-BP model also provides good predictions for G 0 .

Model performance evaluation
G 0 is primarily determined by experiment and empirical equations.Experimental measurements of G 0 are subject to many influences, making it difficult to determine its value effectively.Empirical formulae have the advantage of being simple to calculate and easy to use.
In order to evaluate the performance of the computational models, the G 0 prediction models proposed by Wichtmann 12 and Liu et al. 42 for the calculation of G 0 of the soil were selected.Liu et al. 42 provided an empirical relationship formula for G 0 in sandy soils, which takes into account the influence of particle size distribution, based on experimental results of quartz sand and volcanic sand: Wichtmann and Triantafyllidis based on the results of the silica sand G 0 tests, the similar G 0 prediction equations are given taking into account the effect of grading:  where C u is the uniformity coefficient, e is the void ratio, σ ′ is the enclosing pressure, and P a is reference stress, usually atmosphere pressure (i.e., 100 kPa).The above two mentioned empirical formulas were used to predict G 0 for different types of soils in the data- base, the comparison between the predicted results and the actual values is shown in Fig. 7.The performance metrics for the traditional empirical formula models and the computational models in predicting the various soil types in the database are summarised in Table 4.
From Fig. 7. it can be seen that the models proposed by Liu' s and Wichtmann's models show significant deviations from the actual values of G 0 in the database.Liu 's model shows that 84.09% of the predicted results have an ARE of less than 50%, while Wichtmann and Triantafyllidis's model has 63.55% of the results with ARE values of less than 50%.The predicted data are widely scattered, indicating a lower prediction accuracy.
Table 4 shows that the BPNN model, the GA-BP model and the XGBoost model all have R2 greater than 0.95, RMSE less than 30, MAE less than 25 and VAF greater than 80%.In contrast, the traditional empirical formulas provide lower performance metrics across all indicators compared to the three artificial intelligence algorithms, indicating poorer predictive performance.
In summary, the XGBoost model has the highest prediction accuracy for predicting soil G 0 , followed by the GA-BP model and the BPNN model, while the empirical formula models have the lowest prediction accuracy.Therefore, when performing G 0 calculations for soil, it is advisable to select the XGBoost model for estimation.

( 2 )
b u =f u w mu b u + p n (3) c n =f n w mu b u + p n

Figure 3 .
Figure 3. Illustrations of sensitivity analysis for G 0 .

Figure 4 .
Figure 4. Comparison of the results predicted by different models on the training set with the actual results: (a) BPNN model (b) GA-BP model (c) XGBoost model.

Figure 5 .
Figure 5.Comparison of the results predicted by different models on the testing dataset with the actual results: (a) BPNN model (b) GA-BP model (c) XGBoost model.

Figure 7 .
Figure 7.Comparison of predicted and actual values of two traditional empirical formulas: (a) Liu et al' s model (b) Witchtmann and Triantafyllidis's model.

Table 1 .
Results of multicollinearity analysis for the complete database.

Table 2 .
Its predictions are less accurate, as only 55% of the data points in the training sets have ARE values below 25%.Consequently, Ideal value of the different performance indicators.the predictive performance of the BPNN model is comparatively inferior to the other two models.Overall, the XGBoost model shows the highest accuracy in predicting G 0 of the training set.